Text Segmentation of Digitized Clinical Texts
نویسنده
چکیده
In this paper, we present the experiments we made to recover the original page layout structure into two columns from layout damaged digitized files. We designed several CRF-based approaches, either to identify column separator or to classify each token from each line into left or right columns. We achieved our best results with a model trained on homogeneous corpora (only files composed of 2 columns) when classifying each token into left or right columns (overall F-measure of 0.968). Our experiments show it is possible to recover the original layout in columns of digitized documents with results of quality.
منابع مشابه
On the unsupervised analysis of domain-specific Chinese texts.
With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a larg...
متن کاملStudies for Segmentation of Historical Texts: Sentences or Chunks?
We present some experiments on text segmentation for German texts aimed at developing a method of segmenting historical texts. Since such texts have no (consistent) punctuation, we use a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields. We compare the performance of this approach on the task of segmenting of text into sente...
متن کاملA Online Appendix: Additional Text Preparation Details
Text gathered from a variety of different sources may not be immediately readable by a computer. First, the data itself might be pictures of text taken from an archive or hand-written manuscripts that are not yet digitized. In these cases, Optical Character Recognition (OCR) technologies may be required. Even if they have been digitized, texts may be stored digitally using different encodings a...
متن کاملA Dynamic Programming Algorithm for the Segmentation of Greek Texts
In this paper we introduce a dynamic programming algorithm to perform linear text segmentation by global minimization of a segmentation cost function which consists of: (a) within-segment word similarity and (b) prior information about segment length. The evaluation of the segmentation accuracy of the algorithm on a text collection consisting of Greek texts showed that the algorithm achieves hi...
متن کاملManaging and Annotating Historical Multimodal Corpora with the eHumanities Desktop An outline of the current state of the LOEWE project ’Illustrations of Goethe’s Faust’
Text corpora are structured sets of text segments that can be annotated or interrelated. Expanding on this, we can define a database of images as an iconographic multimodal corpus with annotated images and the relations between images as well as between images and texts. The Goethe-Museum in Frankfurt holds a significant collection of art work and texts relating to Goethe’s Faust from the early...
متن کامل